This code seeks to explain how do organoids from patients reflect the generality of OV patients. We do clustering of patients and organoids and see where organoids fall – based on signatures, and also based on raw CN profiles.

1 Considerations

1.0.1 To do

  • Create the new britroc OV exposures using the new code that Ruben has provided and the new absolute copy number segments that he has provided too.

1.0.2 Changes in signatures extraction (from Ruben)

  • removing a big from one the features: the first segment was not counted, whih is not too important for OV
  • the pre-processing of CN segments (only applicable to SNP array)

The previous data are:

  • 132 patients (BriTROC-1) using low-cost shallow whole-genome sequencing (sWGS; 0.1×)
  • 112 dWGS HGSOC cases from the Pan-Cancer Analysis of Whole Genomes (PCAWG)
    • OV-US and OV-AU are both ICGC
  • 415 HGSOC cases with SNP array and whole-exome sequencing data from The Cancer Genome Atlas (TCGA)

(total of 659 samples)

BriTROC: there are the original BriTROC segments (from NatGen paper) and new BriTROC segments (called BriTROC 2, here, but it’s not the BriTROC-2 cohort, it’s still the BriTROC-1 dataset, made by Ruben).

## Warning: package 'BiocGenerics' was built under R version 4.0.5
## Warning: package 'GenomeInfoDb' was built under R version 4.0.5

2 Loading organoids data

org<- as(readRDS("data/organoid_exposures.rds"), 'matrix')
# rownames(org) <- paste0('Sample ', 1:nrow(org))
# names_orgs = readxl::read_xlsx("data/NewOrganoidNaming.xlsx")
names_orgs = read_csv("data/NewOrganoidNaming.csv")
## Rows: 18 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): new name
## dbl (1): old name
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names_orgs$`new name`[match(rownames(org), paste0(names_orgs$`old name`, 'org'))]
##  [1] "PDO16" "PDO15" "PDO3"  "PDO9"  "PDO5"  "PDO6"  "PDO10" "PDO1"  "PDO12"
## [10] "PDO4"  "PDO2"  "PDO18" "PDO7"  "PDO17" "PDO8"  "PDO13" "PDO14" "PDO11"
rownames(org) = names_orgs$`new name`[match(rownames(org), paste0(names_orgs$`old name`, 'org'))]

2.1 Data from organoids

rename_rows = function(i, new_names){
  rownames(i) = new_names; return(i)}
## Creating plot... it might take some time if the data are large. Number of samples: 18

## Creating plot... it might take some time if the data are large. Number of samples: 18

Apparently only PDO1 is a WGD sample. I see which log-ratio with any other signature shows a clear difference between this WGD sample and the rest. The ratio between s4 (WGD signature) and s2 is.

##   Var1 Var2      value group
## 1 PDO1    1  0.2064604   WGD
## 2 PDO1    2 13.6860836   WGD
## 3 PDO1    3  3.9450050   WGD
## 4 PDO1    4  1.0000000   WGD
## 5 PDO1    5        Inf   WGD
## 6 PDO1    6  3.3746386   WGD
## 7 PDO1    7        Inf   WGD
## Warning: Removed 11 rows containing missing values (geom_point).

apply(org, 2, function(j) table(factor(j==0, levels=c(T,F))))
##       s1 s2 s3 s4 s5 s6 s7
## TRUE   1  0  0  6  4  4  2
## FALSE 17 18 18 12 14 14 16
table(apply(org[,-5], 1, function(j) paste0(j==0,collapse='-')))
## 
## FALSE-FALSE-FALSE-FALSE-FALSE-FALSE  FALSE-FALSE-FALSE-FALSE-FALSE-TRUE 
##                                   9                                   2 
##  FALSE-FALSE-FALSE-TRUE-FALSE-FALSE   FALSE-FALSE-FALSE-TRUE-TRUE-FALSE 
##                                   3                                   3 
##   TRUE-FALSE-FALSE-FALSE-TRUE-FALSE 
##                                   1

2.1.1 Data from Nature Genetics 2018 paper

We are loading both the original signatures, and the updated signatures.

2.1.2 Number of zeros in exposures

We have two dataframes: with the previous TCGA samples and with the current ones. Both contain the BriTROC and ICGC to this as well (which are shared).

## The percentage of zeros in each cohort is:
## $organoids
## $organoids[[1]]
## [1] "13.492%"
## 
## 
## $ExposuresNatGen
## $ExposuresNatGen$britroc
## [1] "0%"
## 
## $ExposuresNatGen$`OV-AU`
## [1] "0%"
## 
## $ExposuresNatGen$`OV-US`
## [1] "0%"
## 
## $ExposuresNatGen$TCGA
## [1] "0%"
## 
## 
## $UpdatedExposures
## $UpdatedExposures$britroc
## [1] "0%"
## 
## $UpdatedExposures$`OV-AU`
## [1] "0%"
## 
## $UpdatedExposures$`OV-US`
## [1] "0%"
## 
## $UpdatedExposures$`Updated TCGA`
## [1] "24.305%"

This makes the organoids and the TCGA exposures sample, and leaves the other in the periphery of the PCA. I suspect this is due to the number of zero exposures, which are imputated using the robust analyses that I am using here:

  • The number of organoids is 18
  • The number of TCGA samples in the previous (published) cohort was 374.
  • The number of TCGA samples in the current (Ruben’s) cohort is 529.
  • The number of TCGA samples found in the previous cohort but not in the current is 0.
  • The number of TCGA samples found in the current cohort but not in the previous is 374.

We are only selecting the updated exposures, now

which_natgens = c('UpdatedExposures')
which_natgen = 'UpdatedExposures'

3 PCA

For compositional data, in the book Analysing compositional data with R they say that PCA should be done on clr-transformed data. Zeroes are an issue if we use clr using all samples. The robust clr is implemented in the package compositions and deals with this problem by doing the geometric mean over only non-zero values, and setting the clr of a part which is zero to zero.

The plot done with (biplot(princomp(acomp(x)))) is the same as plotting princomp(as(clr(x), ‘matrix’))

3.1 Creating a PCA with the data from the clinical cohorts, and projecting the organoids

## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Saving 7 x 5 in image
## Saving 7 x 5 in image

4 Dendrograms

PCA of a uniform sample from the 3-dimensional simplex (i.e. four parts), with three different transformations

PCA of organoids with three variations

## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Plot the loadings of the organoids (PCA only with organoids)

grouping_cohort_all_basic = c(natgen_metadata[[which_natgen]]$study, rep('organoids', nrow(org)))
grouping_cohort_all_basic[grouping_cohort_all_basic != "organoids"] = "primary"

With imputation of 0.01

With imputation of 0.01

s1 seems not to be of importance in the loadings of this last PCA. Use it as baseline for PCA (good, too, because it’s almost always strictly positive).

## Creating plot... it might take some time if the data are large. Number of samples: 18

PCA with imputation of 0.01, as in the dendrogram

PCA of ALR with imputation of 0.01, as in the dendrogram

## Warning: ggrepel: 9 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

org_rename = org
rownames(org_rename) = gsub("Sample ", "PDO", rownames(org))
grid.arrange(createBarplot(cbind(org_rename[,c(1,3,5,7)], sum=rowSums(org_rename[,c(2,4,6)])), remove_labels = FALSE, order_labels = names(sort(org_rename[,1]))) + 
    labs(y='Exposure')+ ggtitle('Exposures for the organoids')+theme(axis.text.x = element_text(angle = 45))+theme(legend.position = "bottom"),
createBarplot(normalise_rw(cbind(org_rename[,c(1,3,7)], sum=rowSums(org_rename[,c(2,4,6)]))), remove_labels = FALSE, order_labels = names(sort(org_rename[,1]))) + 
    labs(y='Exposure')+ ggtitle('Exposures for the organoids')+theme(axis.text.x = element_text(angle = 45))+theme(legend.position = "bottom"), nrow=1)

## Creating plot... it might take some time if the data are large. Number of samples: 18
## Creating plot... it might take some time if the data are large. Number of samples: 18
grid.arrange(give_pca(normalise_rw(cbind(org_rename[,c(1,3,7)], sum=rowSums(org_rename[,c(2,4,6)]))), names=rownames(org_rename)),
             give_pca(as(compositions::clr(normalise_rw(cbind(org_rename[,c(1,3,7)], sum=rowSums(org_rename[,c(2,4,6)])))), 'matrix'), names=rownames(org_rename)),
             give_pca(as(compositions::clr(normalise_rw(cbind(org_rename[,c(1,3,7)], sum=rowSums(org_rename[,c(2,4,6)])))), 'matrix'), names=rownames(org_rename),
                      give_loadings = T), nrow=1)

Note: s3 and s4 are consistently shown in the same PC axis, in opposite directions.

df_correlations_exposures_cohorts = data.frame(ALR_bsS1_s2=all_natgen[[which_natgen]][,2][1:nrow(natgen_metadata[[which_natgen]])]/all_natgen[[which_natgen]][,1][1:nrow(natgen_metadata[[which_natgen]])],
           ALR_bsS1_s4=all_natgen[[which_natgen]][,4][1:nrow(natgen_metadata[[which_natgen]])]/all_natgen[[which_natgen]][,1][1:nrow(natgen_metadata[[which_natgen]])],
          ALR_bsS1_s6=all_natgen[[which_natgen]][,6][1:nrow(natgen_metadata[[which_natgen]])]/all_natgen[[which_natgen]][,1][1:nrow(natgen_metadata[[which_natgen]])],
                  colour=natgen_metadata[[which_natgen]]$study[1:nrow(natgen_metadata[[which_natgen]])]
                  )
grid.arrange(ggplot(df_correlations_exposures_cohorts, aes(x=ALR_bsS1_s2,y=ALR_bsS1_s4))+
               geom_point()+geom_smooth(method = "lm")+labs(x='ALR_s1(s2)', y='ALR_s1(s4)'),
             ggplot(df_correlations_exposures_cohorts, aes(x=ALR_bsS1_s2,y=ALR_bsS1_s6))+
               geom_point()+geom_smooth(method = "lm")+labs(x='ALR_s1(s2)', y='ALR_s1(s6)'), nrow=1)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 33 rows containing non-finite values (stat_smooth).
## Warning: Removed 7 rows containing missing values (geom_point).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 33 rows containing non-finite values (stat_smooth).
## Warning: Removed 18 rows containing missing values (geom_point).

Pretty interesting plot, as it seems to show four driving forces whicc are orthogonal two-on-two. n1 (s1) and s2 (s3) are anticorrelated and driving most of the variance in organoids. The sum of s2,4,6, is orthogonal to both s1 and s3, and causes most subsequent variation.

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: position_stack requires non-overlapping x intervals

Does the same happen in the cohorts?

give_pca(data_matrix = as(compositions::clr(normalise_rw(cbind(natgen[[which_natgen]][,c(1,3,7)],
                                                                sum=rowSums(natgen[[which_natgen]][,c(2,4,6)])))), 'matrix'),
         names=rownames(natgen[[which_natgen]]), give_loadings = T, print_labels = T, print_both_labels=F)

give_pca(data_matrix = as(compositions::clr(impute(normalise_rw(cbind(natgen[[which_natgen]][,c(1,3,7)],
                                                                sum=rowSums(natgen[[which_natgen]][,c(2,4,6)]))), 1e-2)), 'matrix'),
         names=rownames(natgen[[which_natgen]]), give_loadings = T, print_labels = T, print_both_labels=F)

4.0.1 Some comments on the PCAs

  • Whenever an imputation for zero values is used, a large fraction of organoid samples appear as outliers, and so do the other TCGA samples (which are the other samples that contain zeros)
  • If the robust CLR PCA is used, there is no such distinction
  • If the raw values are used (which I think there should be no problem using), there is no big difference either.

4.1 Hierarchical clustering

4.2 Plot showing differences in amalgamation ratios

par(mfrow=c(1,2))
dendroraw = give_dendrogram_generalised(all_natgen[[which_natgen]], modify_labels = F, keep_only_PDO = T)
dendroclr = give_dendrogram_generalised(as(compositions::clr(all_natgen[[which_natgen]]), 'matrix'), modify_labels = F, keep_only_PDO=T)

par(mfrow=c(1,4), mar=c(0.2,0.2,0.2,0.2))
dendroraw_pdo = give_dendrogram_generalised(all_natgen[[which_natgen]][grepl('PDO', rownames(all_natgen[[which_natgen]])),], modify_labels=F, keep_only_PDO = T)
dendroclr_pdo = give_dendrogram_generalised(as(compositions::clr(all_natgen[[which_natgen]][grepl('PDO', rownames(all_natgen[[which_natgen]])),]), 'matrix'), modify_labels=F, keep_only_PDO = T)
dendroimput_pdo = give_dendrogram_generalised(impute(all_natgen[[which_natgen]][grepl('PDO', rownames(all_natgen[[which_natgen]])),], 1e-2), modify_labels=F, keep_only_PDO = T)
dendroimputclr_pdo = give_dendrogram_generalised(as(compositions::clr(impute(all_natgen[[which_natgen]][grepl('PDO', rownames(all_natgen[[which_natgen]])),], 1e-2)), 'matrix'), modify_labels=F, keep_only_PDO = T)

dendroimputclr_all = give_dendrogram_generalised(as(compositions::clr(impute(all_natgen[[which_natgen]], 1e-2)), 'matrix'), modify_labels=F, keep_only_PDO = F)

Number of samples for this clustering analysis:

dim(all_natgen[[which_natgen]])
## [1] 710   7
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## quartz_off_screen 
##                 2

Now doing the same with a different imputation value

cluster_clades <- cutree(dendroimputclr_all, k=2)[match(rownames(all_natgen[[which_natgen]]),
                                                      names(cutree(dendroimputclr_all, k=2)))]

saveRDS(cluster_clades, "robjects/cluster_clades_imput002clr.RDS")
<<<<<<< HEAD

=======
>>>>>>> ce0a7d8a1293685287cc1c5d453f51696f0478b0
give_pca(as(compositions::clr(impute(all_natgen[[which_natgen]], 1e-2)), 'matrix'), title='imputation+CLR, centered',
         names=c(ifelse(grepl('PDO', rownames(all_natgen[[which_natgen]])),
                        yes=rownames(all_natgen[[which_natgen]]),
                        no=NA)),
         give_loadings = T,
         print_labels=T,
         groups=factor(cluster_clades),
         group_is_factor=F, groups_shape=NULL,
         nrow_legend=1)+
  theme_bw()
## Warning: Removed 692 rows containing missing values (geom_label_repel).

give_pca(as(compositions::clr(impute(all_natgen[[which_natgen]], 1e-3)), 'matrix'), title='imputation+CLR, centered',
         names=c(ifelse(grepl('PDO', rownames(all_natgen[[which_natgen]])),
                        yes=rownames(all_natgen[[which_natgen]]),
                        no=NA)),
         give_loadings = T,
         print_labels=T,
         groups=factor(cutree(dendroimputclr_all, k=2)[match(rownames(all_natgen[[which_natgen]]),
                                                      names(cutree(dendroimputclr_all, k=2)))]),
         group_is_factor=F, groups_shape=NULL,
         nrow_legend=1)+
  theme_bw()
## Warning: Removed 692 rows containing missing values (geom_label_repel).

4.2.1 Co-occurrence of signatures

pairs( (org/org[,1])[,-1], main='ALR with S1')

pairs( (org/org[,2])[,-2], main='ALR with S2')

grid.arrange(ggplot(cbind.data.frame(s2=org[,'s2'], s4=org[,'s4'], lab=rownames(org)), aes(x=s2, y=s4, label=lab))+geom_point()+geom_label_repel(),
ggplot(cbind.data.frame(s2s1=org[,'s2']/org[,'s1'], s4s1=org[,'s4']/org[,'s1'], lab=rownames(org)), aes(x=s2s1, y=s4s1, label=lab))+geom_point()+geom_label_repel(), nrow=1)
## Warning: ggrepel: 6 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

pairs( (all_natgen[[which_natgen]]/all_natgen[[which_natgen]][,1])[,-1], main='exp ALR with s1 as baseline')

pairs( (all_natgen[[which_natgen]]/all_natgen[[which_natgen]][,2])[,-2], main='exp ALR with s2 as baseline')

pairs( (impute(all_natgen[[which_natgen]], 1e-2)/impute(all_natgen[[which_natgen]], 1e-2)[,2])[,-2], main='ALR with s2 as baseline with imputation 1e-2')

alr_s2_natgen_imput002 <- exp(as(compositions::alr(impute(all_natgen[[which_natgen]], 1e-2), ivar = 2), 'matrix'))
alr_s2_natgen <- exp(as(compositions::alr(all_natgen[[which_natgen]], ivar = 2), 'matrix'))
alr_s1_natgen <- exp(as(compositions::alr(all_natgen[[which_natgen]], ivar = 1), 'matrix'))
pairs( alr_s2_natgen_imput002,
       main='exp ALR with s2 as baseline with imputation 1e-2 (compositions package)')

head(cbind.data.frame(alr_s1_natgen, cluster_clades))
##                      s2        s3        s4         s5        s6         s7
## TCGA-04-1331 0.00000000 0.0000000 0.4675971 0.08074177 0.0000000  0.3958399
## TCGA-04-1332 0.12787207 0.0000000 0.3178997 0.24563898 0.2006571  0.7096114
## TCGA-04-1335 0.00000000 0.1892053 0.0000000 0.01740368 0.0000000  0.1488342
## TCGA-04-1336 0.00000000 0.0000000 0.5955696 0.00000000 0.0000000  0.2405088
## TCGA-04-1337 0.07024971 0.0000000 0.1129509 0.00000000 0.0000000  0.0000000
## TCGA-04-1338 8.22150432 2.4660602 2.6013353 2.04572519 2.3362734 28.7330851
##              cluster_clades
## TCGA-04-1331              1
## TCGA-04-1332              1
## TCGA-04-1335              2
## TCGA-04-1336              1
## TCGA-04-1337              2
## TCGA-04-1338              1
ggplot(data.frame(alr_s2_natgen), aes(x=s3, y=s4))+geom_point()+theme_bw()
## Warning: Removed 180 rows containing missing values (geom_point).

head(data.frame(alr_s1_natgen))
##                      s2        s3        s4         s5        s6         s7
## TCGA-04-1331 0.00000000 0.0000000 0.4675971 0.08074177 0.0000000  0.3958399
## TCGA-04-1332 0.12787207 0.0000000 0.3178997 0.24563898 0.2006571  0.7096114
## TCGA-04-1335 0.00000000 0.1892053 0.0000000 0.01740368 0.0000000  0.1488342
## TCGA-04-1336 0.00000000 0.0000000 0.5955696 0.00000000 0.0000000  0.2405088
## TCGA-04-1337 0.07024971 0.0000000 0.1129509 0.00000000 0.0000000  0.0000000
## TCGA-04-1338 8.22150432 2.4660602 2.6013353 2.04572519 2.3362734 28.7330851
ggplot(cbind.data.frame(alr_s1_natgen, cluster_clades),
       aes(x=s4, y=s7, col=cluster_clades))+
  geom_point(alpha=0.2)+facet_wrap(.~cluster_clades)+theme_bw()+guides(col=F)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: Removed 1 rows containing missing values (geom_point).

ggplot(cbind.data.frame(alr_s2_natgen, cluster_clades),
       aes(x=s1, y=s3, col=cluster_clades))+
  geom_point(alpha=0.2)+facet_wrap(.~cluster_clades)+theme_bw()+guides(col=F)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: Removed 117 rows containing missing values (geom_point).

The fraction of zero exposures in the s3-rich clade is - lower in s1 (stat.signif.; p-value = 3.159e-07) - lower in s3 (stat.signif.; p-value < 2.2e-16) - higher in s4 (stat.signif.; p-value < 2.2e-16) - higher in s6 (stat.signif.; but marginally) - higher in s7 (stat.signif.; p-value = 1.409e-05)

zeros_in_clades <- lapply(split(f = cluster_clades,
                                x = cbind.data.frame(all_natgen[[which_natgen]])),
                          function(i) colSums(i==0))
zeros_in_clades
## $`1`
##  s1  s2  s3  s4  s5  s6  s7 
##  34 151 214   7  44 102   3 
## 
## $`2`
##  s1  s2  s3  s4  s5  s6  s7 
##   0  96  15 134  18  83  16
cluster_clades_size <- table(cluster_clades)
zeros_in_clades[[1]]/cluster_clades_size[1]
##          s1          s2          s3          s4          s5          s6 
## 0.074235808 0.329694323 0.467248908 0.015283843 0.096069869 0.222707424 
##          s7 
## 0.006550218
zeros_in_clades[[2]]/cluster_clades_size[2]
##         s1         s2         s3         s4         s5         s6         s7 
## 0.00000000 0.38095238 0.05952381 0.53174603 0.07142857 0.32936508 0.06349206
lapply(1:7, function(i) fisher.test(matrix(c(zeros_in_clades[[1]][i],
                                             cluster_clades_size[1]-zeros_in_clades[[1]][i],
                                             zeros_in_clades[[2]][i],
                                             cluster_clades_size[2]-zeros_in_clades[[2]][i]),
                                           ncol=2 )))
## [[1]]
## 
##  Fisher's Exact Test for Count Data
## 
## data:  matrix(c(zeros_in_clades[[1]][i], cluster_clades_size[1] - zeros_in_clades[[1]][i], zeros_in_clades[[2]][i], cluster_clades_size[2] - zeros_in_clades[[2]][i]), ncol = 2)
## p-value = 3.159e-07
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  5.118443      Inf
## sample estimates:
## odds ratio 
##        Inf 
## 
## 
## [[2]]
## 
##  Fisher's Exact Test for Count Data
## 
## data:  matrix(c(zeros_in_clades[[1]][i], cluster_clades_size[1] - zeros_in_clades[[1]][i], zeros_in_clades[[2]][i], cluster_clades_size[2] - zeros_in_clades[[2]][i]), ncol = 2)
## p-value = 0.1878
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.5733129 1.1165359
## sample estimates:
## odds ratio 
##  0.7995343 
## 
## 
## [[3]]
## 
##  Fisher's Exact Test for Count Data
## 
## data:  matrix(c(zeros_in_clades[[1]][i], cluster_clades_size[1] - zeros_in_clades[[1]][i], zeros_in_clades[[2]][i], cluster_clades_size[2] - zeros_in_clades[[2]][i]), ncol = 2)
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   7.891665 25.872815
## sample estimates:
## odds ratio 
##   13.81518 
## 
## 
## [[4]]
## 
##  Fisher's Exact Test for Count Data
## 
## data:  matrix(c(zeros_in_clades[[1]][i], cluster_clades_size[1] - zeros_in_clades[[1]][i], zeros_in_clades[[2]][i], cluster_clades_size[2] - zeros_in_clades[[2]][i]), ncol = 2)
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.005273055 0.030204872
## sample estimates:
## odds ratio 
## 0.01376754 
## 
## 
## [[5]]
## 
##  Fisher's Exact Test for Count Data
## 
## data:  matrix(c(zeros_in_clades[[1]][i], cluster_clades_size[1] - zeros_in_clades[[1]][i], zeros_in_clades[[2]][i], cluster_clades_size[2] - zeros_in_clades[[2]][i]), ncol = 2)
## p-value = 0.3308
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.7611102 2.6011438
## sample estimates:
## odds ratio 
##   1.381039 
## 
## 
## [[6]]
## 
##  Fisher's Exact Test for Count Data
## 
## data:  matrix(c(zeros_in_clades[[1]][i], cluster_clades_size[1] - zeros_in_clades[[1]][i], zeros_in_clades[[2]][i], cluster_clades_size[2] - zeros_in_clades[[2]][i]), ncol = 2)
## p-value = 0.002358
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.4085098 0.8351804
## sample estimates:
## odds ratio 
##  0.5838318 
## 
## 
## [[7]]
## 
##  Fisher's Exact Test for Count Data
## 
## data:  matrix(c(zeros_in_clades[[1]][i], cluster_clades_size[1] - zeros_in_clades[[1]][i], zeros_in_clades[[2]][i], cluster_clades_size[2] - zeros_in_clades[[2]][i]), ncol = 2)
## p-value = 1.409e-05
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.01805523 0.34568499
## sample estimates:
## odds ratio 
## 0.09753283
pairs( (normalise_rw(all_natgen[[which_natgen]][,-5])/all_natgen[[which_natgen]][,1])[,-1], main='ALR with s1 as baseline, without s5')

sum(grepl('PDO', rownames(all_natgen[[which_natgen]])))
## [1] 18
sum(grepl('org', rownames(all_natgen[[which_natgen]])))
## [1] 0
which_natgen
## [1] "UpdatedExposures"
all_natgen_bool <- all_natgen[[which_natgen]] == 0
all_natgen_without_orgs <- all_natgen[[which_natgen]][!grepl('PDO', rownames(all_natgen[[which_natgen]])),]
all_natgen_without_orgs_imput_alr <- as(compositions::alr(normalise_rw(impute(all_natgen_without_orgs, 1e-2) ), ivar = 1), 'matrix')
orgs_imput_alr <- as(compositions::alr(normalise_rw(impute(org, 1e-2) ), ivar = 1), 'matrix')
orgs_bool <- org == 0
dim(all_natgen_without_orgs_imput_alr)
## [1] 692   6
all_natgen_bool_without_orgs <- all_natgen_bool[!grepl('PDO', rownames(all_natgen[[which_natgen]])),]
cooc_zeros <- outer(1:7, 1:7, Vectorize(function(i,j){
  tab <- (all_natgen_bool_without_orgs[,i] == all_natgen_bool_without_orgs[,j])
  sum(tab)/nrow(all_natgen_bool)
  }))
cooc_zeros_org <- outer(1:7, 1:7, Vectorize(function(i,j){
  tab <- (orgs_bool[,i] == orgs_bool[,j])
  sum(tab)/nrow(orgs_bool)
  }))
pearson_imput <- outer(1:7, 1:7, Vectorize(function(i,j){
  # mean(all_natgen_without_orgs[,i]+1e-2) - mean(all_natgen_without_orgs[,j]+1e-2)
  stats::cor(x = normalise_rw(all_natgen_without_orgs[,i]+1e-2 ),
             y = normalise_rw(all_natgen_without_orgs[,j]+1e-2))
  }))
pearson_imput_alrs1 <- outer(1:6, 1:6, Vectorize(function(i,j){
  stats::cor(x = all_natgen_without_orgs_imput_alr[,i],
             y = all_natgen_without_orgs_imput_alr[,j])
  }))
pearson_imput_alrs1_orgs <- outer(1:6, 1:6, Vectorize(function(i,j){
  stats::cor(x = orgs_imput_alr[,i],
             y = orgs_imput_alr[,j])
  }))
pearson_imput
##             [,1]        [,2]        [,3]         [,4]        [,5]       [,6]
## [1,]  1.00000000 -0.38969741  0.02715551 -0.363719493 -0.32854049 -0.2413427
## [2,] -0.38969741  1.00000000  0.05049182 -0.050973286 -0.02547167  0.1278399
## [3,]  0.02715551  0.05049182  1.00000000 -0.510074854  0.09819571 -0.2462266
## [4,] -0.36371949 -0.05097329 -0.51007485  1.000000000 -0.25268794  0.1469121
## [5,] -0.32854049 -0.02547167  0.09819571 -0.252687937  1.00000000 -0.1509271
## [6,] -0.24134265  0.12783991 -0.24622663  0.146912135 -0.15092710  1.0000000
## [7,] -0.47633404 -0.15723314 -0.24460908 -0.003010673  0.18689540 -0.2303376
##              [,7]
## [1,] -0.476334044
## [2,] -0.157233143
## [3,] -0.244609078
## [4,] -0.003010673
## [5,]  0.186895401
## [6,] -0.230337551
## [7,]  1.000000000
rownames(cooc_zeros) <- colnames(cooc_zeros) <- rownames(cooc_zeros_org) <- colnames(cooc_zeros_org) <- paste0('s', 1:7)
rownames(pearson_imput) <- colnames(pearson_imput) <- paste0('s', 1:7)
rownames(pearson_imput_alrs1) <- colnames(pearson_imput_alrs1) <- rownames(pearson_imput_alrs1_orgs) <- colnames(pearson_imput_alrs1_orgs) <- paste0('ALR_s1(', 1:6, ')')
pheatmap(cooc_zeros, main = "Co-occurrence of zeros", cluster_r = F, cluster_c = F)

pheatmap(cooc_zeros_org, main = "Orgs Co-occurrence of zeros", cluster_r = F, cluster_c = F)

pheatmap(pearson_imput, main = "Pearson correlation of imputated activities (prob)", cluster_r = F, cluster_c = F)

pheatmap(pearson_imput_alrs1, main = "Pearson correlation of imputated activities (ALR s1)", cluster_r = F, cluster_c = F)

pheatmap(pearson_imput_alrs1_orgs, main = "Orgs Pearson correlation of imputated activities (ALR s1)", cluster_r = F, cluster_c = F)

dim(all_natgen_bool)
## [1] 710   7
dim(all_natgen_bool_without_orgs)
## [1] 692   7
df_correlations_exposures_organoids = data.frame(ALR_bsS1_s2=org[,2]/org[,1],
                  ALR_bsS1_s4=org[,4]/org[,1],
                  ALR_bsS1_s6=org[,6]/org[,1])
grid.arrange(ggplot(df_correlations_exposures_organoids, aes(x=ALR_bsS1_s2,y=ALR_bsS1_s4))+
               geom_point()+geom_smooth(method = "lm")+labs(x='ALR_s1(s2)', y='ALR_s1(s4)'),
             ggplot(df_correlations_exposures_organoids, aes(x=ALR_bsS1_s2,y=ALR_bsS1_s6))+
               geom_point()+geom_smooth(method = "lm")+labs(x='ALR_s1(s2)', y='ALR_s1(s6)'), nrow=1)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

4.3 Comparison of exposures between exposures clusterings, and with and RNA-Seq data

The imputation+clr dendrograms are not the same when considering only the organoids than when using all samples from TCGA, but in general the two agree in which organoids get clustered and the overal structure of the dendrogram.

# dendroimputclr_all = give_dendrogram_generalised(as(compositions::clr(impute(all_natgen[[which_natgen]], 1e-2)), 'matrix'), modify_labels=F, keep_only_PDO = F)
dendroimputclrimpute_org <- give_dendrogram_generalised(as(compositions::clr(impute(all_natgen[[which_natgen]][grepl('PDO', rownames(all_natgen[[which_natgen]])),], 1e-2)), 'matrix'), modify_labels=F, keep_only_PDO = T, plot_dendro=F)
hclust_rnaseq <- readRDS("../RNASeq_DE_resistant_sensitive/objects/hclust_rnaseq.RDS")
tanglegram(as.dendrogram (dendroimputclrimpute_org), as.dendrogram(dendroimputclr_all), main='PDO-only clustering vs all exposures', cex_main=1)
## Warning in intersect_trees(dend1, dend2, warn = TRUE): The labels in both tree
## had different values - trees were pruned.

tanglegram(as.dendrogram (dendroimputclrimpute_org), as.dendrogram(hclust_rnaseq), main='PDO-only clustering vs RNA-Seq clustering', cex_main=1)
## Warning in intersect_trees(dend1, dend2, warn = TRUE): The labels in both tree
## had different values - trees were pruned.

tanglegram(as.dendrogram (dendroimputclr_all), as.dendrogram(hclust_rnaseq), main='All exposures clustering vs RNA-Seq clustering', cex_main=1)
## Warning in intersect_trees(dend1, dend2, warn = TRUE): The labels in both tree
## had different values - trees were pruned.

Now with TPM

hclust_rnaseq_TPM <- readRDS("../RNASeq_DE_resistant_sensitive/objects/hclust_rnaseq_TPM.RDS")
tanglegram(as.dendrogram (dendroimputclrimpute_org), as.dendrogram(hclust_rnaseq_TPM), main='PDO-only clustering vs RNA-Seq TPM clustering', cex_main=1)
## Warning in intersect_trees(dend1, dend2, warn = TRUE): The labels in both tree
## had different values - trees were pruned.

tanglegram(as.dendrogram (dendroimputclr_all), as.dendrogram(hclust_rnaseq_TPM), main='All exposures clustering vs RNA-Seq TPM clustering', cex_main=1)
## Warning in intersect_trees(dend1, dend2, warn = TRUE): The labels in both tree
## had different values - trees were pruned.

tanglegram(as.dendrogram (hclust_rnaseq), as.dendrogram(hclust_rnaseq_TPM), main='RNA-Seq clustering vs RNA-Seq TPM clustering', cex_main=1)

5 Analysis of CN profiles

additional genomic data comparing the tumours to the organoids in terms of ploidy, number of rearrangements and any other things that you think could be relevant

5.1 Load the segments

5.1.1 Distribution of features

pcawg_CN_features = readRDS("data/pcawg_CN_features.rds")
tcga_CN_features = readRDS("data/tcga_CN_features.rds")

BriTROC_absolute_copynumber = readRDS("../../cnsignatures/manuscript_Rmarkdown/data/BriTROC_absolute_copynumber.rds")
BriTROC2_CN_features = readRDS("data/6_TCGA_Signatures_on_BRITROC/0_BRITROC_absolute_CN.rds")

organoids_absolute_copynumber = readRDS("data/organoid_absolute_CN.rds")
sampleNames(organoids_absolute_copynumber) = names_orgs$`new name`[match(gsub("org", "", sampleNames(organoids_absolute_copynumber)), names_orgs$`old name`)]
organoids_CN_features = extractCopynumberFeatures(organoids_absolute_copynumber)

BriTROC_CN_features = readRDS("data/BriTROC_CN_features.rds")

The number of segments can be taken either from segsize (first column) or from copynumber (last column). This is just for PCAWG and TCGA! Not for BriTROC. Any idea why this is the case?

** Note: I am plotting this as the log!**

** Note 2: I am pooling together segments from several samples and analysing them all together! I.e. I have not averaged over samples **

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

## quartz_off_screen 
##                 2

5.1.2 Tests for equality of means of the distribution of features (average of segments within a sample)

  • I have computed the average value for each feature for each sample. Then I have compared the distribution of means of the patient cohorts all together, versus the distribution of means of the organoids
  • This is different from the approach above, where we were pooling segments for all samples (i.e. not keeping patient information)
  • In all cases I have log-transformed the data, which ensues normality, to compute a p-value for the difference in means
  • I am using the default r t-test, which is the Welch t-test, which does not assume equality of variances
  • In all cases there is no statistical significance between organoids the cohorts, indicating a similar distribution (at least when it comes to their means)
  • The only feature which is truly problematic is copy number, which very evidently shows a bimodal distribution (since there are the normal segments, or segments of copy number 1, and the amplified segments, which appear as two populations)
distrib_segsize = create_distrib_df('segsize', 'value')
distrib_bp10MB = create_distrib_df('bp10MB', 'value')
distrib_osCN = create_distrib_df('osCN', 'value')
distrib_bpchrarm = create_distrib_df('bpchrarm', 'ct1')
distrib_changepoint = create_distrib_df('changepoint', 'value')
distrib_copynumber = create_distrib_df('copynumber', 'value')

#------------------------------------------------------------------------------------------------#

## 1/6 breakpoints per 10MB
grid.arrange(give_joint_histogram(list(distrib_bp10MB[['org']]$`mean(ct1_num)`,
                          distrib_bp10MB[['pcawg']]$`mean(ct1_num)`,
                          distrib_bp10MB[['tcga']]$`mean(ct1_num)`,
                          distrib_bp10MB[['BriTROC']]$`mean(ct1_num)`), no_colour=FALSE),
give_joint_histogram(list(distrib_bp10MB[['org']]$`mean(ct1_num)` %>% log,
                          distrib_bp10MB[['pcawg']]$`mean(ct1_num)` %>% log,
                          distrib_bp10MB[['tcga']]$`mean(ct1_num)` %>% log,
                          distrib_bp10MB[['BriTROC']]$`mean(ct1_num)` %>% log), no_colour=FALSE)+ggtitle('Log transform'), ncol=2)

t.test(distrib_bp10MB[['org']]$`mean(ct1_num)` %>% log,
       c(distrib_bp10MB[['pcawg']]$`mean(ct1_num)` %>% log,
         distrib_bp10MB[['tcga']]$`mean(ct1_num)` %>% log,
         distrib_bp10MB[['BriTROC']]$`mean(ct1_num)` %>% log))
## 
##  Welch Two Sample t-test
## 
## data:  distrib_bp10MB[["org"]]$`mean(ct1_num)` %>% log and c(distrib_bp10MB[["pcawg"]]$`mean(ct1_num)` %>% log, distrib_bp10MB[["tcga"]]$`mean(ct1_num)` %>% log, distrib_bp10MB[["BriTROC"]]$`mean(ct1_num)` %>% log)
## t = -0.52491, df = 18.513, p-value = 0.6059
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3415468  0.2047773
## sample estimates:
##  mean of x  mean of y 
## -0.8415706 -0.7731859
#------------------------------------------------------------------------------------------------#
## 2/6 segment size
grid.arrange(give_joint_histogram(list(distrib_segsize[['org']]$`mean(ct1_num)`,
                          distrib_segsize[['pcawg']]$`mean(ct1_num)`,
                          distrib_segsize[['tcga']]$`mean(ct1_num)`,
                          distrib_segsize[['BriTROC']]$`mean(ct1_num)`), no_colour=FALSE)+ggtitle('Raw'),
give_joint_histogram(list(distrib_segsize[['org']]$`mean(ct1_num)` %>% log,
                          distrib_segsize[['pcawg']]$`mean(ct1_num)` %>% log,
                          distrib_segsize[['tcga']]$`mean(ct1_num)` %>% log,
                          distrib_segsize[['BriTROC']]$`mean(ct1_num)` %>% log), no_colour=FALSE)+ggtitle('Log transform'), ncol=2)

t.test(distrib_segsize[['org']]$`mean(ct1_num)` %>% log,
       c(distrib_segsize[['pcawg']]$`mean(ct1_num)` %>% log,
         distrib_segsize[['tcga']]$`mean(ct1_num)` %>% log,
         distrib_segsize[['BriTROC']]$`mean(ct1_num)` %>% log))
## 
##  Welch Two Sample t-test
## 
## data:  distrib_segsize[["org"]]$`mean(ct1_num)` %>% log and c(distrib_segsize[["pcawg"]]$`mean(ct1_num)` %>% log, distrib_segsize[["tcga"]]$`mean(ct1_num)` %>% log, distrib_segsize[["BriTROC"]]$`mean(ct1_num)` %>% log)
## t = 0.77176, df = 18.338, p-value = 0.4501
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1435866  0.3106786
## sample estimates:
## mean of x mean of y 
##  16.69896  16.61541
#------------------------------------------------------------------------------------------------#
## 3/6 oscillating CN
grid.arrange(give_joint_histogram(list(distrib_osCN[['org']]$`mean(ct1_num)`,
                          distrib_osCN[['pcawg']]$`mean(ct1_num)`,
                          distrib_osCN[['tcga']]$`mean(ct1_num)`,
                          distrib_osCN[['BriTROC']]$`mean(ct1_num)`), no_colour=FALSE)+ggtitle('Raw'),
give_joint_histogram(list(distrib_osCN[['org']]$`mean(ct1_num)` %>% log,
                          distrib_osCN[['pcawg']]$`mean(ct1_num)` %>% log,
                          distrib_osCN[['tcga']]$`mean(ct1_num)` %>% log,
                          distrib_osCN[['BriTROC']]$`mean(ct1_num)` %>% log), no_colour=FALSE)+ggtitle('Log transform'), ncol=2)
## Warning: Removed 15 rows containing non-finite values (stat_bin).

t.test(distrib_osCN[['org']]$`mean(ct1_num)` %>% log,
       remove_infty(c(distrib_osCN[['pcawg']]$`mean(ct1_num)` %>% log,
         distrib_osCN[['tcga']]$`mean(ct1_num)` %>% log,
         distrib_osCN[['BriTROC']]$`mean(ct1_num)` %>% log)))
## 
##  Welch Two Sample t-test
## 
## data:  distrib_osCN[["org"]]$`mean(ct1_num)` %>% log and remove_infty(c(distrib_osCN[["pcawg"]]$`mean(ct1_num)` %>% log, distrib_osCN[["tcga"]]$`mean(ct1_num)` %>% log, distrib_osCN[["BriTROC"]]$`mean(ct1_num)` %>% log))
## t = 0.36511, df = 18.948, p-value = 0.7191
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1355887  0.1928764
## sample estimates:
##  mean of x  mean of y 
## -0.5555398 -0.5841836
#------------------------------------------------------------------------------------------------#
## 4/6 num of breakpoints per chromosome arm
grid.arrange(give_joint_histogram(list(distrib_bpchrarm[['org']]$`mean(ct1_num)`,
                          distrib_bpchrarm[['pcawg']]$`mean(ct1_num)`,
                          distrib_bpchrarm[['tcga']]$`mean(ct1_num)`,
                          distrib_bpchrarm[['BriTROC']]$`mean(ct1_num)`), no_colour=FALSE)+ggtitle('Raw'),
give_joint_histogram(list(distrib_bpchrarm[['org']]$`mean(ct1_num)` %>% log,
                          distrib_bpchrarm[['pcawg']]$`mean(ct1_num)` %>% log,
                          distrib_bpchrarm[['tcga']]$`mean(ct1_num)` %>% log,
                          distrib_bpchrarm[['BriTROC']]$`mean(ct1_num)` %>% log), no_colour=FALSE)+ggtitle('Log transform'), ncol=2)

## Same distribution of num of breakpoints per chromosome arm
t.test(distrib_bpchrarm[['org']]$`mean(ct1_num)` %>% log,
       c(distrib_bpchrarm[['pcawg']]$`mean(ct1_num)` %>% log,
         distrib_bpchrarm[['tcga']]$`mean(ct1_num)` %>% log,
         distrib_bpchrarm[['BriTROC']]$`mean(ct1_num)` %>% log))
## 
##  Welch Two Sample t-test
## 
## data:  distrib_bpchrarm[["org"]]$`mean(ct1_num)` %>% log and c(distrib_bpchrarm[["pcawg"]]$`mean(ct1_num)` %>% log, distrib_bpchrarm[["tcga"]]$`mean(ct1_num)` %>% log, distrib_bpchrarm[["BriTROC"]]$`mean(ct1_num)` %>% log)
## t = -0.53247, df = 18.515, p-value = 0.6007
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3425379  0.2037960
## sample estimates:
## mean of x mean of y 
##  1.078022  1.147393
#------------------------------------------------------------------------------------------------#
## 5/6 num of changepoints
grid.arrange(give_joint_histogram(list(distrib_changepoint[['org']]$`mean(ct1_num)`,
                          distrib_changepoint[['pcawg']]$`mean(ct1_num)`,
                          distrib_changepoint[['tcga']]$`mean(ct1_num)`,
                          distrib_changepoint[['BriTROC']]$`mean(ct1_num)`), no_colour=FALSE)+ggtitle('Raw'),
give_joint_histogram(list(distrib_changepoint[['org']]$`mean(ct1_num)` %>% log,
                          distrib_changepoint[['pcawg']]$`mean(ct1_num)` %>% log,
                          distrib_changepoint[['tcga']]$`mean(ct1_num)` %>% log,
                          distrib_changepoint[['BriTROC']]$`mean(ct1_num)` %>% log), no_colour=FALSE)+ggtitle('Log transform'), ncol=2)

## Same distribution of num of changepoints
t.test(distrib_changepoint[['org']]$`mean(ct1_num)` %>% log,
       c(distrib_changepoint[['pcawg']]$`mean(ct1_num)` %>% log,
         distrib_changepoint[['tcga']]$`mean(ct1_num)` %>% log,
         distrib_changepoint[['BriTROC']]$`mean(ct1_num)` %>% log))
## 
##  Welch Two Sample t-test
## 
## data:  distrib_changepoint[["org"]]$`mean(ct1_num)` %>% log and c(distrib_changepoint[["pcawg"]]$`mean(ct1_num)` %>% log, distrib_changepoint[["tcga"]]$`mean(ct1_num)` %>% log, distrib_changepoint[["BriTROC"]]$`mean(ct1_num)` %>% log)
## t = 0.19539, df = 18.554, p-value = 0.8472
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1691729  0.2039479
## sample estimates:
## mean of x mean of y 
## 0.4200699 0.4026824
#------------------------------------------------------------------------------------------------#
## 6/6 copy number
grid.arrange(give_joint_histogram(list(distrib_copynumber[['org']]$`mean(ct1_num)`,
                          distrib_copynumber[['pcawg']]$`mean(ct1_num)`,
                          distrib_copynumber[['tcga']]$`mean(ct1_num)`,
                          distrib_copynumber[['BriTROC']]$`mean(ct1_num)`), no_colour=FALSE)+ggtitle('Raw'),
give_joint_histogram(list(distrib_copynumber[['org']]$`mean(ct1_num)` %>% log,
                          distrib_copynumber[['pcawg']]$`mean(ct1_num)` %>% log,
                          distrib_copynumber[['tcga']]$`mean(ct1_num)` %>% log,
                          distrib_copynumber[['BriTROC']]$`mean(ct1_num)` %>% log), no_colour=FALSE)+ggtitle('Log transform'), ncol=2)

## Same distribution of copy number of segments
t.test(distrib_copynumber[['org']]$`mean(ct1_num)` %>% log,
       c(distrib_copynumber[['pcawg']]$`mean(ct1_num)` %>% log,
         distrib_copynumber[['tcga']]$`mean(ct1_num)` %>% log,
         distrib_copynumber[['BriTROC']]$`mean(ct1_num)` %>% log))
## 
##  Welch Two Sample t-test
## 
## data:  distrib_copynumber[["org"]]$`mean(ct1_num)` %>% log and c(distrib_copynumber[["pcawg"]]$`mean(ct1_num)` %>% log, distrib_copynumber[["tcga"]]$`mean(ct1_num)` %>% log, distrib_copynumber[["BriTROC"]]$`mean(ct1_num)` %>% log)
## t = 0.35387, df = 18.18, p-value = 0.7275
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1286706  0.1808402
## sample estimates:
## mean of x mean of y 
##  1.160435  1.134350

Explanation: The distribution of sample-averaged values of features are the sample between organoids and the cohorts of primary tissue samples for the number of breakpoints per 10MB (Welch Two Sample t-test on log-transformed data, p-value = 0.6059), segment size (Welch Two Sample t-test on log-transformed data, p-value = 0.4501 ), oscillating copy number (Welch Two Sample t-test on log-transformed data, p-value = 0.7191), number of breakpoints per chromosome arm (Welch Two Sample t-test on log-transformed data, p-value = 0.6007), number of changepoints (Welch Two Sample t-test on log-transformed data, p-value = 0.8472), and the copy number of the segments (Welch Two Sample t-test on log-transformed data, p-value = 0.7275).

5.2 Number of segments; Poisson and NegBin GLM

TL;DR with a negative binomial model, which is much more appropriate in this setting than a Poisson, there is no difference in the distributions of organoids and non-organodis when it comes to number of segments.

## Analysis of Deviance Table
## 
## Model 1: length ~ 1
## Model 2: length ~ bool
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       824      58116                          
## 2       823      58024  1   91.318 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning in anova.negbin(...): only Chi-squared LR tests are implemented
## Likelihood ratio tests of Negative Binomial Models
## 
## Response: length
##   Model    theta Resid. df    2 x log-lik.   Test    df LR stat.   Pr(Chi)
## 1     1 3.076226       824       -9971.807                                
## 2  bool 3.081258       823       -9970.344 1 vs 2     1  1.46259 0.2265185
sapply(list(glm_poisson_length[glm_poisson_length$names == 'organoids','length']), function(i) c(mean(i), sd(i)))
##          [,1]
## [1,] 169.0556
## [2,]  77.4866
sapply(list(glm_poisson_length[glm_poisson_length$names != 'organoids','length']), function(i) c(mean(i), sd(i)))
##          [,1]
## [1,] 200.4002
## [2,] 134.0146

Unfortunately the scaling factor has to do with the width of the bins in the histogram.

5.3 Ploidy

To get the ploidy, I just have to compute the weighted average of the copy number segments (this is computed from the absolute copy number profiles objects, since they specify, for each segment, its length and its ploidy).

Use getSegTable to get the segments from this Biobase file

Needs to be ammended: it has to also count areas of the genome for which we don’t have segments! i.e. we need to know the effective genome size

## we only want the ovarian ones
ICGC_absolute_copynumber_AU = readRDS("data/CN_Calls_ABSOLUTE_PCAWG/OV-AU.segments.raw.rds")
ICGC_absolute_copynumber_US = readRDS("data/CN_Calls_ABSOLUTE_PCAWG/OV-US.segments.raw.rds")
ICGC_absolute_copynumber_AU = ICGC_absolute_copynumber_AU[,c('sample', 'chr', 'startpos', 'endpos', 'segVal')]
ICGC_absolute_copynumber_US = ICGC_absolute_copynumber_US[,c('sample', 'chr', 'startpos', 'endpos', 'segVal')]

segtables_ICGC_absolute_copynumber_AU = lapply(sort(unique(ICGC_absolute_copynumber_AU$sample)),
                                            function(samplename)
                                              ICGC_absolute_copynumber_AU[ICGC_absolute_copynumber_AU$sample == samplename,])
segtables_ICGC_absolute_copynumber_AU = lapply(segtables_ICGC_absolute_copynumber_AU, function(i) { colnames(i)[colnames(i) == "chr"] = "chromosome";
colnames(i)[colnames(i) == "endpos"] = "end";
return(i) } )
names(segtables_ICGC_absolute_copynumber_AU) = unique(ICGC_absolute_copynumber_AU$sample)

segtables_ICGC_absolute_copynumber_US = lapply(sort(unique(ICGC_absolute_copynumber_US$sample)),
                                            function(samplename) ICGC_absolute_copynumber_US[ICGC_absolute_copynumber_US$sample == samplename,])
segtables_ICGC_absolute_copynumber_US = lapply(segtables_ICGC_absolute_copynumber_US, function(i) { colnames(i)[colnames(i) == "chr"] = "chromosome";
colnames(i)[colnames(i) == "endpos"] = "end";
return(i) } )
names(segtables_ICGC_absolute_copynumber_US) = unique(ICGC_absolute_copynumber_US$sample)

## for ICGC, remove the samples row and put it in the rows
segtables_ICGC_absolute_copynumber_US = lapply(segtables_ICGC_absolute_copynumber_US, function(i){
  rownames(i) = i$samples
  i = i[,-1]
  i})
segtables_ICGC_absolute_copynumber_AU = lapply(segtables_ICGC_absolute_copynumber_AU, function(i){
  rownames(i) = i$samples
  i = i[,-1]
  i})
## Check that there are no sex chromosomes included anywhere
## [1] TRUE TRUE TRUE TRUE TRUE
system("mkdir -p data/clean_segtables/")
saveRDS(segtables_BriTROC_absolute_copynumber, "data/clean_segtables/segtables_BriTROC_absolute_copynumber.RDS")
saveRDS(segtables_ICGC_absolute_copynumber_US, "data/clean_segtables/segtables_ICGC_absolute_copynumber_US.RDS")
saveRDS(segtables_ICGC_absolute_copynumber_AU, "data/clean_segtables/segtables_ICGC_absolute_copynumber_AU.RDS")
saveRDS(segtables_TCGA_absolute_copynumber, "data/clean_segtables/segtables_TCGA_absolute_copynumber.RDS")

In the previous computation of the sample ploidy (ploidy_ICGC_US_previous, etc.) I had not included normal segments. Now I do but actually the result is very similar. (Note slghtly under- and over-estimated values at the right and left of ploidy=2).

plot(log(c(ploidy_ICGC_US_previous, ploidy_ICGC_AU_previous, ploidy_TCGA_previous,
           ploidy_BriTROC_previous, ploidy_organoids_previous)),
     log(c(ploidy_ICGC_US, ploidy_ICGC_AU, ploidy_TCGA, ploidy_BriTROC, ploidy_organoids)),
     ylab='log ploidy of samples with current mean ploidy calculation',
     xlab='log ploidy of samples with previous mean ploidy calculation')
abline(coef=c(0,1), lty='dashed')
abline(v=log(2), lty='dashed')

Ploidy is not normally distributed and it’s right-skewed. Moreover, the distribution is bimodal: I guess there are genomes in which there is a clear amplification and genomes which are more or less normal, so centered around 2.

I am also using a robust linear regression, but I don’t think this is suitable either.

t.test(log(ploidy_organoids), log(ploidy_BriTROC))
## 
##  Welch Two Sample t-test
## 
## data:  log(ploidy_organoids) and log(ploidy_BriTROC)
## t = -0.90962, df = 20.948, p-value = 0.3734
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.16268846  0.06368736
## sample estimates:
## mean of x mean of y 
## 0.9401143 0.9896149
MASS::rlm(ploidy~group,
         data=cbind.data.frame(ploidy=c(ploidy_organoids, ploidy_BriTROC), group=c(rep(1,length(ploidy_organoids)), rep(2, length(ploidy_BriTROC)))))
## Call:
## rlm(formula = ploidy ~ group, data = cbind.data.frame(ploidy = c(ploidy_organoids, 
##     ploidy_BriTROC), group = c(rep(1, length(ploidy_organoids)), 
##     rep(2, length(ploidy_BriTROC)))))
## Converged in 5 iterations
## 
## Coefficients:
## (Intercept)       group 
##   2.4791559   0.1280964 
## 
## Degrees of freedom: 298 total; 296 residual
## Scale estimate: 0.675

5.4 Segments across the genome

## Segments across the genome
# (sapply(chrlen$V1, function(i) gsub("chr", "", i)))
sorted_chroms = chrlen$V1[order(as.numeric((gsub("chr", "", chrlen$V1))))]
## Warning in eval(quote(list(...)), env): NAs introduced by coercion
chrom_lenths = chrlen[match(sorted_chroms, chrlen$V1),]

5.5 Ploidy

5.6 Ranking for the number of copy number events and ploidy

## quartz_off_screen 
##                 2
## quartz_off_screen 
##                 2

6 Other

7 {r, print_organoids_absolute_copynumber} # head(organoids_absolute_copynumber) #

Amplifications/deletions of selected genes in organoids

##    PDO16    PDO15     PDO3     PDO9     PDO5     PDO6    PDO10     PDO1 
## 1.752252 2.706511 2.921865 2.825767 2.617572 2.639571 2.886007 3.659920 
##    PDO12     PDO4     PDO2    PDO18     PDO7    PDO17     PDO8    PDO13 
## 1.865682 2.818707 2.595914 2.884280 1.719599 2.864533 1.715556 2.753563 
##    PDO14    PDO11 
## 2.805061 3.042383

How come the segments include all the genome?

full_GR_example <- c(GRanges_chroms, as(rename_cols(data.frame(segtables_ICGC_absolute_copynumber_US[[1]])), "GRanges"))
granges_example <- GenomicRanges::disjoin(full_GR_example, with.revmap=TRUE, ignore.strand=TRUE)
diploid_segments = granges_example[sapply(granges_example$revmap, function(i) !any(i > length(GRanges_chroms))),]
diploid_segments
## GRanges object with 68 ranges and 1 metadata column:
##        seqnames              ranges strand |        revmap
##           <Rle>           <IRanges>  <Rle> | <IntegerList>
##    [1]        1     8799389-8863518      * |             1
##    [2]        1 121352501-128899999      * |             1
##    [3]        1 247249601-249250621      * |             1
##    [4]        2             1-20015      * |             2
##    [5]        2 242985494-243199373      * |             2
##    ...      ...                 ...    ... .           ...
##   [64]       20   62615607-63025520      * |            19
##   [65]       21          1-14413101      * |            22
##   [66]       21   46937501-48129895      * |            22
##   [67]       22          1-12199999      * |            21
##   [68]       22   49687501-51304566      * |            21
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
length(diploid_segments)
## [1] 68
chrlen
##       V1        V2
## 1   chr1 249250621
## 2   chr2 243199373
## 3   chr3 198022430
## 4   chr4 191154276
## 5   chr5 180915260
## 6   chr6 171115067
## 7   chr7 159138663
## 8   chrX 155270560
## 9   chr8 146364022
## 10  chr9 141213431
## 11 chr10 135534747
## 12 chr11 135006516
## 13 chr12 133851895
## 14 chr13 115169878
## 15 chr14 107349540
## 16 chr15 102531392
## 17 chr16  90354753
## 18 chr17  81195210
## 19 chr18  78077248
## 20 chr20  63025520
## 21  chrY  59373566
## 22 chr19  59128983
## 23 chr22  51304566
## 24 chr21  48129895
dim(natgen[[which_natgen]])
## [1] 692   7

7.1 Correlations between signatures and features

as.vector(number_of_segments$organoids[match(gsub('PDO', '', names(number_of_segments$organoids)), gsub('PDO ', '', rownames(org)))])
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# plot(as.vector(number_of_segments$organoids[match(gsub('PDO', '', names(number_of_segments$organoids)), gsub('PDO ', '', rownames(org)))]),
#      org[,'s6'])
# plot(as.vector(number_of_segments$organoids[match(gsub('PDO', '', names(number_of_segments$organoids)), gsub('PDO ', '', rownames(org)))]),
#      org[,'s6']/org[,'s1'])

7.2 Ascites

7.3 Observations about specific orgnaoids

sort(number_of_segments$organoids)
## 
##  PDO1 PDO11 PDO18 PDO16  PDO7 PDO12  PDO8 PDO13 PDO17  PDO2  PDO6  PDO5 PDO10 
##    64    64   113   118   120   130   137   149   155   165   174   176   178 
##  PDO3  PDO9 PDO15 PDO14  PDO4 
##   202   205   212   288   393
  • PDO4 (tandem dup) is the organoid with the highest number of segments (393)
  • The WGD sample, PDO1, is the one with the lowest number of segments, but highest ploidy
## MYC: chr8:128,747,680-128,755,197
lapply(names(segtables_organoids_absolute_copynumber), function(name_org){
  give_plot_areas_of_interest(segtab=segtables_organoids_absolute_copynumber[[name_org]],
                              chrom_AOI=8, start_AOI=118747680,
                              end_AOI=118755197,
                              subclonal_line1=2, subclonal_line2=3, title=paste0('MYC ', name_org))
})
## [[1]]
## [[1]][[1]]
## GRanges object with 1 range and 2 metadata columns:
##       seqnames              ranges strand |        revmap     CNval
##          <Rle>           <IRanges>  <Rle> | <IntegerList> <numeric>
##   [1]        8 118747680-118755197      * |          1,52   1.95097
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
## 
## [[1]][[2]]

## 
## 
## [[2]]
## [[2]][[1]]
## GRanges object with 1 range and 2 metadata columns:
##       seqnames              ranges strand |        revmap     CNval
##          <Rle>           <IRanges>  <Rle> | <IntegerList> <numeric>
##   [1]        8 118747680-118755197      * |         1,104   8.89182
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
## 
## [[2]][[2]]

## 
## 
## [[3]]
## [[3]][[1]]
## GRanges object with 1 range and 2 metadata columns:
##       seqnames              ranges strand |        revmap     CNval
##          <Rle>           <IRanges>  <Rle> | <IntegerList> <numeric>
##   [1]        8 118747680-118755197      * |         1,125   6.28908
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
## 
## [[3]][[2]]

## 
## 
## [[4]]
## [[4]][[1]]
## GRanges object with 1 range and 2 metadata columns:
##       seqnames              ranges strand |        revmap     CNval
##          <Rle>           <IRanges>  <Rle> | <IntegerList> <numeric>
##   [1]        8 118747680-118755197      * |         1,129   6.38458
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
## 
## [[4]][[2]]

## 
## 
## [[5]]
## [[5]][[1]]
## GRanges object with 1 range and 2 metadata columns:
##       seqnames              ranges strand |        revmap     CNval
##          <Rle>           <IRanges>  <Rle> | <IntegerList> <numeric>
##   [1]        8 118747680-118755197      * |          1,72   3.90838
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
## 
## [[5]][[2]]

## 
## 
## [[6]]
## [[6]][[1]]
## GRanges object with 1 range and 2 metadata columns:
##       seqnames              ranges strand |        revmap     CNval
##          <Rle>           <IRanges>  <Rle> | <IntegerList> <numeric>
##   [1]        8 118747680-118755197      * |          1,75   4.13676
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
## 
## [[6]][[2]]

## 
## 
## [[7]]
## [[7]][[1]]
## GRanges object with 1 range and 2 metadata columns:
##       seqnames              ranges strand |        revmap     CNval
##          <Rle>           <IRanges>  <Rle> | <IntegerList> <numeric>
##   [1]        8 118747680-118755197      * |         1,102   4.31683
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
## 
## [[7]][[2]]

## 
## 
## [[8]]
## [[8]][[1]]
## GRanges object with 1 range and 2 metadata columns:
##       seqnames              ranges strand |        revmap     CNval
##          <Rle>           <IRanges>  <Rle> | <IntegerList> <numeric>
##   [1]        8 118747680-118755197      * |          1,32   9.61704
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
## 
## [[8]][[2]]

## 
## 
## [[9]]
## [[9]][[1]]
## GRanges object with 1 range and 2 metadata columns:
##       seqnames              ranges strand |        revmap     CNval
##          <Rle>           <IRanges>  <Rle> | <IntegerList> <numeric>
##   [1]        8 118747680-118755197      * |          1,69   2.02661
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
## 
## [[9]][[2]]

## 
## 
## [[10]]
## [[10]][[1]]
## GRanges object with 1 range and 2 metadata columns:
##       seqnames              ranges strand |        revmap     CNval
##          <Rle>           <IRanges>  <Rle> | <IntegerList> <numeric>
##   [1]        8 118747680-118755197      * |         1,228   3.05402
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
## 
## [[10]][[2]]

## 
## 
## [[11]]
## [[11]][[1]]
## GRanges object with 1 range and 2 metadata columns:
##       seqnames              ranges strand |        revmap     CNval
##          <Rle>           <IRanges>  <Rle> | <IntegerList> <numeric>
##   [1]        8 118747680-118755197      * |          1,81   2.05615
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
## 
## [[11]][[2]]

## 
## 
## [[12]]
## [[12]][[1]]
## GRanges object with 1 range and 2 metadata columns:
##       seqnames              ranges strand |        revmap     CNval
##          <Rle>           <IRanges>  <Rle> | <IntegerList> <numeric>
##   [1]        8 118747680-118755197      * |          1,56   3.06459
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
## 
## [[12]][[2]]

## 
## 
## [[13]]
## [[13]][[1]]
## GRanges object with 1 range and 2 metadata columns:
##       seqnames              ranges strand |        revmap     CNval
##          <Rle>           <IRanges>  <Rle> | <IntegerList> <numeric>
##   [1]        8 118747680-118755197      * |          1,39   2.65679
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
## 
## [[13]][[2]]

## 
## 
## [[14]]
## [[14]][[1]]
## GRanges object with 1 range and 2 metadata columns:
##       seqnames              ranges strand |        revmap     CNval
##          <Rle>           <IRanges>  <Rle> | <IntegerList> <numeric>
##   [1]        8 118747680-118755197      * |          1,82   3.95907
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
## 
## [[14]][[2]]

## 
## 
## [[15]]
## [[15]][[1]]
## GRanges object with 1 range and 2 metadata columns:
##       seqnames              ranges strand |        revmap     CNval
##          <Rle>           <IRanges>  <Rle> | <IntegerList> <numeric>
##   [1]        8 118747680-118755197      * |          1,49   2.97383
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
## 
## [[15]][[2]]

## 
## 
## [[16]]
## [[16]][[1]]
## GRanges object with 1 range and 2 metadata columns:
##       seqnames              ranges strand |        revmap     CNval
##          <Rle>           <IRanges>  <Rle> | <IntegerList> <numeric>
##   [1]        8 118747680-118755197      * |          1,76    3.9072
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
## 
## [[16]][[2]]

## 
## 
## [[17]]
## [[17]][[1]]
## GRanges object with 1 range and 2 metadata columns:
##       seqnames              ranges strand |        revmap     CNval
##          <Rle>           <IRanges>  <Rle> | <IntegerList> <numeric>
##   [1]        8 118747680-118755197      * |         1,148   4.97256
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
## 
## [[17]][[2]]

## 
## 
## [[18]]
## [[18]][[1]]
## GRanges object with 1 range and 2 metadata columns:
##       seqnames              ranges strand |        revmap     CNval
##          <Rle>           <IRanges>  <Rle> | <IntegerList> <numeric>
##   [1]        8 118747680-118755197      * |          1,28   5.91658
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
## 
## [[18]][[2]]

sapply(plts_AOI_subclonal, `[`, 1)
<<<<<<< HEAD
saveRDS(natgen[[2]][grepl('TCGA', rownames(natgen[[2]])),],
        file = '~/Desktop/TCGA_exposures_12112019.RDS')
======= >>>>>>> ce0a7d8a1293685287cc1c5d453f51696f0478b0